Reservoir sampling

Table of Contents

wiki

implemention

array R[k];    // result
integer i, j;

// fill the reservoir array
for each i in 1 to k do
    R[i] := S[i]
done;

// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
    j := random(1, i);   // important: inclusive range
    if j <= k then
        R[j] := S[i]
    fi
done

Relation to Fisher-Yates shuffle

To initialize an array a of n elements to a randomly shuffled copy of S, both 0-based:

a[0] ← S[0]
for i from 1 to n - 1 do
    r ← random (0 .. i)
    a[i] ← a[r]
    a[r] ← S[i]

Note that although the rest of the cards are shuffled, only the top k are important in the present context. Therefore, the array a need only track the cards in the top k positions while performing the shuffle, reducing the amount of memory needed. Truncating a to length k, the algorithm is modified accordingly:

To initialize an array a to k random elements of S (which is of length n), both 0-based:

a[0] ← S[0]
for i from 1 to k - 1 do
    r ← random (0 .. i)
    a[i] ← a[r]
    a[r] ← S[i]

for i from k to n - 1 do
    r ← random (0 .. i)

    if (r < k) then a[r] ← S[i]

Since the order of the first k cards is immaterial, the first loop can be removed and a can be initialized to be the first k items of S. This yields Algorithm R.

Proof of correctness1

After n ≥ k items in S have been processed, each of those items is sampled with probability s/n.

prove the theorem by induction. Basic step: for n = k the statement is obviously correct. Inductive step: assuming the correctness for n = m, next we show that the statement is also correct for n = m + 1.

第(m+1)个数是o, 前m个书中的o'

  • o被采样仅当为o产生的随机数x的范围在[1,m],o被采样的概率为s/(m+1)
  • o' 被采样(在处理数o后)仅当(i)在采样前m个数的过程后它任然在 (ii)为o产生的随机数x不等于数o'在数组的位置

通过归纳法得,(i)以概率s/m发生,(ii)以m/(m+1)概率发生,因为两事件独立,同时发生的概率(s/m) * (m/(m+1)) = s/(m+1)

Proof2

reservoir为R是简单的一个大小为s的数组,数据流是大小为n的D

对于第k+1个数,一个位置<=k的元素i在R中的概率为s/k.元素i被这第k+1数取代的概率是这第k+1个数被选中的概率乘以元素i被选中的概率,是s/(k+1) * 1/s = 1/(k+1), 即元素i不被取代的概率是k/(k+1).

所以在第k+1次取数后,任意给定的元素仍然在R中的概率是:(第k步被选中并且没有在第k+1步被移除) =s/k * k/(k+1) = s/(k+1). 所以,当k+1 = n时,任意元素在R中的概率是s/n.

Distributing Reservoir Sampling1

distribute the reservoir sampling algorithm to n nodes.

Split the data stream into n partitions, one for each node. Apply reservoir sampling with reservoir size s, the final reservoir size, on each of the partitions. Finally, aggregate each reservoir into a final reservoir sample by carrying out reservoir sampling on them.

Lets say you split data of size n into 2 nodes, where each partition is of size n/2. Sub-reservoirs R1 and R2 are each of size s.

Probability that a record will be in sub-reservoir is: s / (n/2) = 2s/n

The Probability that a record will end up in the final reservoir given it is in a sub-reservoir is: s/(2s) = 1/2.

It follows that the probability any given record will end up in the final reservoir is: 2s/n * 1/2 = s/n

Weighted Reservoir Sampling Variation2

This is sorta tricky. Pavlos S. Efraimidis figured out the solution in 2005 in a paper titled Weighted Random Sampling with a Reservoir. It works similarly to the assigning a random number solution above.

Ref

Footnotes:

Author: Shi Shougang

Created: 2015-03-05 Thu 23:21

Emacs 24.3.1 (Org mode 8.2.10)

Validate